Though basically we have categorical data we still want to have a look at correlation between final grade and all others variables
Result: in general there is a poor dependence between variables and final grade, but it is important to remind that all variable are categorical, but failures and absences (failures is not fully continuous). Also, we highlighted the alcohol consumption variables to show that there is no clear linear dependence.
Closer look at dependence between Grades and alcohol consumption
>>>>>>> c9b7b6e37a1cbd151383d040a4bc0a160bd1f3b9:main.html
Simple presentation of grades before diving deeper to show the quantity of the students divided into different success groups.
We will consider as successful only students with grades from “C” to “A” and as unsuccessful with grades “F” and “D”. As Portuguese educational system uses numeric marks from 0 to 20, we needed to transform it into the five-level classification system (Table 2).
Table 2: The five-level classification system
| Country |
(excellent/very good) |
(good) |
(satisfactory) |
(sufficient) |
(fail) |
| Portugal/France |
16-20 |
14-15 |
12-13 |
10-11 |
0-9 |
| Ireland/USA |
A |
B |
C |
D |
F |
<<<<<<< HEAD:EDA_html.html
{ ddf <- full_df %>%
mutate(
success = ifelse(G3_d == 'F' | G3_d == 'D', 'No', 'Yes'),
G3_d = factor(G3_d, levels = c('F', 'D', 'C', 'B', 'A'))
)
p1 <- ddf %>%
ggplot(aes(x=G3_d, fill=success)) +
geom_bar(position = "dodge") +
theme_bw() +
theme(legend.position="none") +
scale_fill_manual(values=matlab.colors[2:1]) +
labs(y = "Number of students",
x = "Final Grade (Discrete)",
title = "Final Grade Discrete Distribution") +
theme(plot.title = element_text(hjust = 0.5))
p2 <- ddf %>%
ggplot(aes(x=success, fill=success)) +
geom_bar(position = "dodge") +
scale_fill_manual(values=matlab.colors[2:1]) +
theme_bw() +
scale_x_discrete(breaks=c("No","Yes"),
labels=c("Low", "High")) +
labs(y = "Number of students",
x = "Final Grade (Binary)",
title = "Final Grade Binary Distribution") +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1, p2, nrow=1)
}
As you can seen the amount of students in both groups is almost the same.
Let us have a look directly at dependence between Grades and Alcohol consumption.
{
class_df = full_df %>%
mutate(
s_class = ifelse(G3 < 12, 1, 2),
s_class = s_class + ifelse(as.numeric(Salc) < 5, 0, 2),
G3_d = factor(G3_d, levels = c('F', 'D', 'C', 'B', 'A'))
)
class_df %>%
ggplot(aes(x=Salc, G3_d)) +
geom_jitter(aes(colour = as.factor(s_class))) +
scale_color_manual(values=matlab.colors[c(2, 3, 4, 1)]) +
theme_bw() +
theme(legend.position="none") +
labs(y = "Final Grade (Discrete)",
x = "Alcohol consumption",
title = "Final period grade") +
theme(plot.title = element_text(hjust = 0.5))
}
Result: decent grades are much less frequent among high alcohol consumption Students, also there is no dependence between low alcohol consumption and decent grades. The last point can be seen more vividly in the following figure.
class_df %>%
mutate(
failure = ifelse( s_class %% 2 == 1, 'High', 'Low'),
addiction = ifelse( s_class > 3, 'High', 'Low'),
failure = factor(failure, levels = c('Low', 'High')),
addiction = factor(addiction, levels = c('Low', 'High'))
) %>%
ggplot() +
geom_mosaic(aes(x=product(failure, addiction), fill=failure), na.rm = TRUE) +
scale_fill_manual(values=matlab.colors[2:1]) +
theme_bw() +
theme(legend.position="none") +
labs(x = "The amount of alcohol consumption",
y = "Final Grade (Binary)",
title = "Final period grade") +
theme(plot.title = element_text(hjust = 0.5))
Result: you can clearly see that among low alcohol consumption students the amount of successful students is approximately the same as less successful students.
Dependence between Grades and such variables as gender and age
How does age and gender affect on success?
{
age_min = (full_df$age - 0.5) %>% min() %>% as.factor()
age_max = (full_df$age + 0.5) %>% max() %>% as.factor()
age_min = rep(age_min, 5)
age_max = rep(age_max, 5)
y_min = c(0, 10, 12, 14, 16)
y_max = c(y_min[2:5], 20)
common_alpha = rep(0.4, 5)
nstudents = full_df %>%
group_by(G3_d) %>%
summarise(n=n())
y_text = (y_min + y_max)/2
grades = c('F', 'D', 'C', 'B', 'A')
l_text = rep('', 5)
for(i in 1:5) l_text[i] = glue::glue("{grades[i]}-students\n area ({nstudents$n[6-i]})")
full_df %>%
ggplot() +
annotate("rect",
xmin = age_min, xmax = age_max,
ymin = y_min, ymax = y_max,
fill = colours_5[5:1],
alpha = common_alpha
) +
geom_split_violin(aes(x=age %>% as.factor(), y=G3, fill=gender),
draw_quantiles = c(0.5)) +
scale_fill_manual(values=alpha(c('#D3D3D3', '#808080'), 0.8)) +
theme_bw() +
annotate("text", x = "22", y = y_text, label = l_text) +
labs(x = "Age",
y = "Final grade",
title = "Comparison males and females' grades against age") +
theme(plot.title = element_text(hjust = 0.5))
}
Result: the majority of students are in C and D areas, besides there is no difference between males and females, but after the age of 18 the distribution is completely different and we do not have a solid explanation for that.